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Abstract 

This article describes a method for using optimization to derive efficient independent tran¬ 
sition functions for Markov chain Monte Carlo (MCMC) simulations with application to 
sampling from a posterior density 7r(x) which is multimodal. Although the proposals in 
the Randomized Maximum Likelihood method are placed in regions of high probability, 
the distribution of samples is only approximately correct. Introduction of auxiliary vari¬ 
ables simplifies the computation of the proposal density, allowing the Metropolis-Hastings 
method to be applied. We restrict our attention to the special case for which the target 
density is the product of a multivariate Gaussian prior and a likelihood function for which 
the errors in observations are additive and Gaussian. 


Introduction 

Many geoscience inverse problems are characterized by nonlinearity in the relationship 
between parameters and observations E3EU26]- In some cases, the posterior pdf for the 
model parameters after assimilation of observations is multimodal, although the presence 
or absence of multiple modes in large models is seldom verified. Examples of smaller 
models in which multiple modes have been verified include flow and transport problems in 
porous media, where the nonlinearity resulting from uncertainty in connections in layered 
media and observation operators that average over layers can both lead to multiple modes 
PET], Monte Carlo sampling is often the only viable method of quantifying uncertainty in 
the posterior distribution for problems of this type, but the cost of obtaining a substantial 
number of independent samples can be prohibitive. 

In order to obtain a reasonably high probability of generating acceptable random- 
walk transitions in Metropolis sampling, the transitions must generally be small: for small 
enough transition distances, almost ah proposals will be accepted. In that case, however, it 
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is possible to remain trapped in some modes for very long times before jumping to another 
mode and the time required to obtain useful Monte Carlo estimates can be impractical 
M- The problem of optimal scaling of the proposal distribution to maximize the efficiency 
of mixing of a random walk Metropolis algorithm has been solved for certain classes of 
target distributions I2H BU- These scaling rules are not applicable in all contexts, and in 
particular are not appropriate for multimodal distributions M- The ability to design a 
proposal distribution that mixes well is key to the efficiency of the MCMC algorithm for 
multimodal distributions. 

In this paper, we discuss an augmented variable independence Metropolis sampler that 
uses minimization to place proposals in regions of high probability. Augmented variable 
methods have been shown to be useful in MCMC to construct efficient sampling algorithms 
and in particular for sampling from multimodal distributions [3], where the auxiliary vari¬ 
ables might allow for large jumps between modes m ■ Storvik [ 33 ] suggests that auxiliary 
variables can be used in Metropolis-Hastings (MH) either for generating better proposals 
or for easier calculation of acceptance probabilities and points out that the use of auxiliary 
variables allows flexibility in choosing the target distribution in the augmented space, as 
long as the marginal distribution for the variables of interest is unchanged. In this paper, 
the augmented variables are useful for obtaining a marginal proposal density that is close 
to the target marginal density, and for simplifying evaluation of the MH ratio. 

Independence Metropolis samplers are Metropolis-Hastings samplers for which the 
transition probability g(y|x) does not depend on the current state of the chain (g(y|x) = 
q(y)) [36]• For independence sampling, it is useful to choose the proposal density q(x ) 
such that vr (x)/q(x) is bounded and as nearly constant as possible, in much the same way 
that optimal proposal densities would be chosen for importance sampling or for rejection 
sampling. Liu [ 20 ] compares the efficiency of the three methods, concluding that the in¬ 
dependence Metropolis sampling is asymptotically comparable to rejection sampling, but 
that independence Metropolis sampling is simpler to implement as it does not require 
knowledge of the envelope constant. 

A number of methods have been developed for MH that use either local gradient 
information to make transitions to regions of high probability [UU22] or to allow better 
exploration of the target density than can be obtained by random-walk sampling mum- 
In several methods, the transition is based on an optimization step. The Multiple-Try 
method m uses local minimization along a line in state space to propose candidate 
states for updating one of the current population of states. Alternatively, Tjelmeland and 
Hegstad [31] described a two-step transition in the Metropolis-Hastings method in which 
mode-jumping transitions alternated with mode-exploring transitions. 

The Randomized Maximum Likelihood (RML) method [25] was developed as an inde¬ 
pendence Metropolis sampler for the special case in which the target density 7 r(x) is the 
product of a multivariate Gaussian prior p(x) and a likelihood function p(d|x) for which 
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the errors in observations d are Gaussian. The prior distribution for X is assumed to be 
multivariate normal with mean /i. and covariance C x . Observations d Q b s = g(x) + with 
~ IV(0, Cd) are assimilated resulting in a posterior density 

tt(x) oc exp (~7>( x - A t ) Tc x 1 ( x - R) - J(g( x ) - d obs ) T C^ 1 (g(x) - d obs )^ . (1) 

Candidate samples in RML are obtained by a two-step process. In the first step, random¬ 
ized samples from the prior distribution for model variables and data variables are gener¬ 
ated. In the second step, the candidate state is obtained by minimization of a stochastic 
objective function. The distribution of candidate states q(x) depends on both the model 
and data variables so computation of the transition probability requires marginalization, 
q(x) = f q(x, d) dd. For problems in which the prior is Gaussian, the observation operator 
is linear, and errors in observations are additive and Gaussian, the proposal density q(x ) 
is equal to the target density vr(x) so that all independently proposed candidate states 
are accepted in Metropolis-Hastings. For problems in which the observation operator is 
nonlinear, proposed candidates are always placed in regions of high probability density, 
but evaluation of the Metropolis-Hastings acceptance test is difficult as it requires evalua¬ 
tion of the marginal density for model variables. Consequently, in practice the MH test is 
ignored for RML in large geoscience inverse problems USEE]. Bardsley et al. [2] showed 
that for non-linear problems in which the Jacobian of the observation operator is differen¬ 
tiable and monotonic, the MH acceptance test could be based on a Jacobian evaluated at 
the maximizer of Eq. [lj In this case, the probability of candidate states could be relatively 
easily computed and sampling was shown to be correct for several test problems. 

Here we describe a modified Metropolization approach in which the randomized data 
are included in the state vector. The proposal of model variables is accomplished via a 
local optimization, but calibrated data variables are retained in the state space and used 
for evaluation of the probability of acceptance of the state. By doing this, the method 
allows sampling from multimodal distributions and the need to compute the marginal 
density for model variables is avoided. Correct sampling requires determination of the 
distribution of the proposals which is easier when the state space is augmented with data 
variables. 

The paper is organized as follows. In Section [I] we summarize the original RML algo¬ 
rithm as an independence Metropolis sampler for which only model variables are included 
in the state of the Markov chain and discuss a pseudo-marginal MCMC approach to the 
sampling problem. In Section [2] we describe an augmented state version of the RML inde¬ 
pendence Metropolis sampler. For this version, computation of the Metropolis-Hastings 
ratio is simpler as it only requires the Jacobian of the transformation of the augmented 
state variables. Section [3] provides several examples for validation of the algorithm. 
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1 RML for model variables 


In both the quasi-linear estimation method m and the randomized maximum likelihood 
method |25j, samples are generated in high probability regions of the posterior pdf by the 
simple expedient of simultaneously minimizing the magnitude of misfit of the model with 
perturbed observations and the magnitude of the change in the model variables from an 
unconditional sample. Unconditional realizations of models variables (x uc ) are sampled 
from the prior and realizations of observation (d uc ) are sampled from the distribution of 
model noise, 

x uc ~ N[n, C x ] and d uc ~ IV[d obs , C d ], 


A sample from a high probability region is generated by minimizing a nonlinear least- 
squares cost function: 


x* = argmin 

X 


7 ,(x - x uc ) T C x x (x - x uc ) + ~(g(x) - d U c) T C d 1 (g(x) - du 


( 2 ) 


Although the samples generated in this way tend to be located in parts of parameter 
space with high posterior probability, the samples are not necessarily distributed according 
to the posterior distribution. The distribution of RML samples can, however, be computed 
if the inverse transformation from the calibrated samples to the prior samples is known. 
If the objective function (Eq. [2| is differentiable, then a necessary (but not sufficient) 
condition for x* to be a minimizer is that 


x uc = x* + C x G T C/(g(x*) - d uc ) (3) 

where G T = V^g. Eq. [^provides the inverse transformation for the unconditional samples, 
restricted to the region of (x*, d uc ) for which x* is a minimizer of Eq. [2j 
The joint distribution for x uc and d uc is 

/(x uc ,d uc ) oc exp^ — (x uc /r.) C x (x uc /i) — (d uc d obs ) (d uc d obs )^ 

and hence the joint probability of proposing (x*,d uc ) is 


<?(x*, d uc ) — /(x uc (x*,d uc ),d uc )|J| (4) 

where J is the Jacobian of the inverse transformation from (x*,d*) to (x uc ,d uc ). By 
choosing d* = d uc , the Jacobian determinant requires only computation of 

I ji_ <9(x uc , d U c) 

<9(x*) 

The marginal probability of proposing x* is then obtained by integrating g(x*,d uc ) over 
the data space, 

g(x*) — / /(x U c(x*, d^), due) |J(x*, d uc )| dd uc . (5) 

Jd 
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For an independence sampler in the Metropolis-Hastings algorithm [36], the probability 
of proposing a transition to state x* is independent of the current state x, then the proposed 
state x* is accepted with probability 


a(x, x*) 


min 


( 7r(x^)q(x) \ 

V ’ *(x)q(x*)J 


( 6 ) 


The probability density for the proposed model, 7r(x*), is computed from Eq. [I] Note that 
the probability 7r is not based on the quality of the match to the perturbed data obtained 
in the minimization, but on the quality of the match to the prior model and the actual 
observed data. Algorithm [I] describes the method for using RML as a proposal mechanism 
in a Metropolis-Hastings procedure [25]. 


Algorithm 1 RML for model variables 


1: Generate initial state xo N[n,C a 
2: for i < i max do 


3: 

4: 

5: 

6 : 

7: 

8 : 

9: 


procedure Generate candidate state 

Generate x uc ~ N[n, C x ] and d nc ~ A[d obs , C rf ] 

Compute 

x* = argmin x g(x - x uc ) T C“ 1 (x - x uc ) + 7j(g(x) - d uc ) T C^ 1 (g(x) - d uc ) 

end procedure 

Compute proposal density <?(x*) = J D g(x*, d uc ) dd uc using Eq. [d] for g(x*,d uc ). 
Compute a(xi,x*) = min (l, 


10: Generate u from U( 0,1) 

11: if u < a(xj,x*) then 


12: Xj_|_i 4— x* 

13: else 


14: Xj + i •(- X,; 

15: end if 

16: end for 


When the relationship of model variables to observations is linear and observation 
errors are additive and Gaussian, it is straightforward to show that the minimizers of 
Eq. [ 2 ] are distributed as the posterior distribution for x [3123] and hence all proposed 
transitions are accepted when Algorithm [T] is used for Gauss-linear inverse problems. 

When Algorithm[l]was applied to a problem for which the posterior distribution was bi- 
modal (as in Example 1, below), the acceptance rate for proposed independent transitions 
was still very high (76%) [25j. Although the method was shown to sample correctly, the 
work required for computing the marginal distribution for the candidate states made ap¬ 
plication of the full method impractical in real problems. Consequently, when the method 
has been applied to large problems, the MH acceptance test was omitted [6J El US], with 
the understanding that the sampling method is then only approximately correct. 
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The need to numerically compute the marginal density q(x*) from g(x*,d uc ) in Algo¬ 
rithm 1 could potentially be avoided through application of the pseudo-marginal MCMC 
method mm, which only requires an unbiased estimate of the marginal distribution. 
When the pseudo-marginal MCMC method as described in Algorithm 9.7 of [29J was ap¬ 
plied to Example 1 (Section |3.1[ ) using a using a single unbiased sample at each step, 
however, sampling from a Markov chain of length 10' 5 was very poor. Results would al¬ 
most certainly be improved by increasing the sample size used to represent the marginal 
distribution [32], but because of the dependence of x* on d nc in the RML method, it is 
not feasible to draw multiple d uc for a given x*. 


2 Augmented state RML 

Because the major challenge with the use of Metropolized RML as in Algorithm [l] is the 
computation of the marginal probability of proposing the candidate model variables, we 
introduce a modification that avoids the need for computation of the marginal proposal 
density by augmenting the state x* with suitably defined data variables d*. Although 
auxiliary variables can improve sampling in several ways [33], the purpose of including 
auxiliary variables in this case is primarily to simplify computation of the MH acceptance 
criterion. 


2.1 Posterior (target) distribution 


Consider the joint space of model variables and data in the case for which the obser¬ 
vation errors and the modelization errors can both be modeled as Gaussian. The prior 
distribution of model variables is assumed Gaussian with mean /i. and covariance C x ; the 
prior distribution of observations is assumed to be Gaussian with mean d obs and covari¬ 
ance (1 — 7 )Crf; modelization errors are Gaussian with mean 0 and covariance 7 C^. The 
posterior joint probability of (x, d), obtained by combining states of information is 


vr(x,d) ocexp -^(x - n) T C x x (x - p) - ^-(g(x) - d) T C d 1 (g(x) - d) 


1 


2(1 _ 7 ) (d " d obs ) C^ 1 (d - d obs )J. (7) 

After some rearrangement of terms, the joint probability can be factored into the product 
of the marginal posterior density for model variable x and the posterior density for d 
conditional to x, 

vr(x,d) ocexp -^(x - /i) T C" 1 (x - /z) - ^(g(x) - d obs ) T C" 1 (g(x) - d obs ) 


exp 


1 


9 7 (1 _ ( d “ g( x ) + 7(g( x ) - d 0 bs)) T C d 1 (d - g(x) + 7 (g(x) - d obs ))J. (8) 

Although the introduction of Gaussian modelization error does not affect the posterior 
marginal distribution of x as long as it is compensated for by a reduction in observation 
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error, the posterior joint probability of (x, d) does depend strongly on the value of 7 as 
discussed in Section [2~3l on the selection of 7 . 


2.2 Proposal distribution 

As in the original version of RML, we draw samples from the normal distributions x uc ~ 
N[fj,,C x \ and d uc ~ JV[d 0 b s , CJ and denote the joint distribution as p(-,-). Candidate 
transitions in a MH algorithm are obtained by minimizing a nonlinear least squares cost 
function 


(x*,d*) = argmin [7 (x - x uc ) T C 3 . 1 (x - x uc ) + -*-(g(x) - d) T C d 1 (g(x) - d) 

x.d 


+ 


1 


(d d uc ) T CT 1 (d d uc ) . (9) 


2(1 -p) 

Implicit expressions for the relationship between (x*,d*) and (x uc ,d uc ) are derived from 
requirement that the gradient at the minimum must vanish. At the minimum, the following 
relationships hold: 

. (10) 


and 


x* - x uc + -C x .G i C d i (g(x*) - d*) = 0 

p 


d* - />d uc - (1 - p)g(x*) = 0. 


(ID 


Using Eq. 11 to eliminate d* from Eq. [10] gives 


x* x uc G X G c d (g(x*) d uc ) — 0 


( 12 ) 


which shows that the marginal distribution of x* is independent of p. For Gauss-linear 
problems, x* is distributed according to the posterior marginal distribution of x. 


The inverse mapping, obtained from Eqs. 10 and 11 


Xuc 


x* + lC I ,G T C d 1 (g(x*) - d*) 

due 


l d *~ ( 1 7 £ )g( x *) 


(13) 


is invertible if the domain of the mapping is restricted to the range of the mapping given 
by Eq. which we denote by T. The distribution of candidate states, is then 


g(x*,d*) = 


P (x uc (x*, d*), d uc (x*, d*)) |J(x*,d*)| if (x*,d*) £ T 
0 if (x*, d*) 0 T 


(14) 


where x uc (x*, d*) and d uc (x*, d*) are dehned in Eq. 13 and the dependence of the expres¬ 
sions on p has been suppressed for clarity. 

Because the state is composed of both model and data variables, the Jacobian matrix 
takes a block form with 


wt = r* + 1 £ £ £ c ? [g‘ 7 [c,- 1 ] w ^ [c pf ort*,) - a*)] 

U X * P ry i j L 
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and 


G ia = 


dg l 

dx a 


Other derivatives required for the Jacobian can be compactly written in matrix notation. 


<9x„ 


1 


<9d* 

dd uc 

<9x* 


= --C X G T C7\ 


1 ~ P 
P 


and 


= -I. 
P 


ddnc 
dd* 

For a linear Gaussian inverse problem, the relationship between the proposed states 
and the unconditional states is 

_1 C,.G T (GC,.G T + C d )“ 1 

1 (gc,g t + P c d ) (gc x g t + c d )- 1 


- 1 

X 

* 


* 1 

-1 

1 

1- 


C d (GC x G T + C d y 
(l-p)G T C d (GC x G T + C d y 


= A 


X UC 

diir; 


The covariance of the proposed states (x*,d*) is 


Gpost — A 


C x 

0 


C x > 

GC, 


0 

c^g t 

GC x ,G t + p 2 C d A- D l C d 


where C x / = (C“ 1 + G t C^' 1 G) and Ap = (GC J: G T + C(i). Note that for p = 0, 
the covariance of proposed states is the same as the posteriori covariance for the joint 
model-data space of Tarantola [34] , although the covariance is then rank deficient. 


The joint density given by Eq. 14 can be used as the proposal density in an inde¬ 
pendence MH sampler for sampling of the joint posterior distribution (Eq. [Tj) with the 
objective of sampling from the posterior marginal distribution for model variables (Eq. [I]). 
Algorithm [2] describes the use of minimization to provide candidate states. 

2.3 Selection of p and 7 

The efficiency of the independence Metropolis sampler depends on two factors. The first 


is requirement that the proposal distribution q (Eq. 14) be close to the target density ir 

The 


(Eq. [7]) so that the ratio 7r(x*, d*)/g(x*, d*) is as close to constant as possible 
second requirement is loosely that the proposal density be positive wherever the target 
density is positive so that the Markov chain is irreducible. 





















Algorithm 2 Augmented state RML for model-data variables 


1: Generate xo ~ N[fi, C x ] and do ~ lV[d obs , CJ 
2: for i < i max do 


3: 

4: 

5: 

6 : 

7: 

8 : 

9: 

10 : 

11 : 

12 : 

13: 

14: 

15: 

16: 


procedure Generate candidate state 

Generate x uc ~ N[n, C x ] and d uc ~ IV[d obs , C d ] 

Compute 

(x*, d*) = argmin x d g(x - x uc ) T C“ 1 (x - x uc ) 

+ 4(g( x ) “ d ) Tc d 1 (g( x ) - d) + 2(I^y( d - d uc) T C^ 1 (d 

end procedure 


Compute proposal density g(x*,d*) using Eq. 14 

U)gl 

i)g( 


Compute a(xi, d u x*, d*) = min (l, 


Generate u from U{ 0,1) 
if u < o(xj,dj,x*,d*) then 
Xj + i «— x* and d; + i <- d* 
else 

Xj+i Xj and d i+ i <r- d* 

end if 


17: end for 



It is convenient to compare the proposal densities and the target densities based on 
the distribution of d given x. From Eq. [8j the mode of the target density for d is located 
at 

d = g( x ) - 7(g( x ) - d obs ) (15) 

and the covariance of d given x in the target distribution is 7(1 — 7 )C^. Similarly, it 


is possible to factor the proposal density Eq. [14] as the product of the target density for 
x, a Gaussian distribution for d* given x*, term that depends on d*, and the Jacobian 
determinant, 


<?(x*, d*) oc exp 

x exp 


( x * - vf C x 1 ( x * - A*) - \ (g( x *) - d obs) T C d 1 (g(x*) - d obs ) 


1 

Tp 


2 (d* - g(x*) + pV 1 ? 7 (x*)) V (d* — g(x*) + p\ 1 T/( x *)) 


x exp 


x ? ?(x*) T V V x *) 


| J (x*, d* 


where 


and 


v = c,- 1 + c- 1 gc,g t c- 1 


??( x *) = C d 1 [(Gx* - g(x*)) - (G/r - d obs )]. 


(16) 

(17) 

(18) 


9 

















Note that when the observation operator g is linear, 77 , V, and J are all independent of 
x* and d*. 

For linear observations, both the target density and the proposal density are Gaussian 
in which case it is sufficient to compare the inodes and the variances of the two distributions 
for selection of p and 7 . The modes of the target distribution and the mode of the proposal 
distribution in this case are equal when 

-pV _1 77 = y(Gx* - d obs ) 


and the variances of the two distributions are equal when 

/ 0 2 V -1 = 7 (1 - 7 )C d . 

For linear g , it is therefore not possible to satisfy both relationships exactly, except by 
choosing p = 7 = 0, in which case the distributions are degenerate. Figure [l] compares 
the joint distributions for several values of 7 and p in the case of a linear observation. It 
is clear that the two distributions will only be similar if 7 is small. 




(a) Joint target pdf for three values of 7 : (b) Joint proposal pdf for three values of p\ 
0.02 (black), 0.5 (dashed), 0.98 (red). 0.05 (black), 0.5 (dashed), 0.95 (red). 


Figure 1: Effect of the values of p and 7 on the joint target and proposal densities for a 
linear observation. 


For nonlinear problems, the mapping (Eq. 13) from (x uc ,d uc ) to (x*,d*) is not invert¬ 
ible unless the domain of the mapping is restricted. In this case, it is necessary to select 
p and 7 in such a way that the proposal density is concentrated in the region where the 
mapping is bijective (and hence the proposal density is positive). Insight can be obtained 
by considering the case for which a single model variable and a single data variable. 

If d 2 g/dx 2 7 ^ 0, then solving for the location of the boundary, we obtain 

-1 


d.=g(x,)+C d (^£j (pCP+G^CpG). 


(19) 
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Increasing the value of p shifts the boundary away from d* = g(x*) in the direction given 
by the sign of the second derivative of g. 


The distance between the mode of the target density (Eq. 15) and the boundary of the 


region of positive proposal density (Eq. 19) is 

72 


A d = 


Cd ( 0 ) {PC - 1 + GT °d lG ) + 7G/(s) - dobs). 


The probability mass outside of the domain of validity is determined also by the variance of 
the distribution. The fraction of the cumulative distribution that is outside of the domain 
of g(x*, d*) is less than e if 




Ad 


Vr(i “7 )Cd. 


> 1 -e. 


Thus for highly nonlinear problems, it is necessary to choose 7 < 1 and p ~ 1 to satisfy 
the requirement that all states with significant probability density can be reached. 





(a) One branch of q{x,d) for (b) One branch of q(x,d) for (c) One branch of n(x,d) for 
p = 0.5. p = 0.999. two values of 7 : 0.001 (black), 

0.5 (red). Dashed blue con¬ 
tour shows boundary of pro¬ 
posal region. 

Figure 2: Effect of the values of p and 7 on the joint target and proposal densities for a 
quadratic observation. 

Figure [2] illustrates the effect of the choice of p on the proposal density for the numerical 


example in Section |3.1| Note that increasing p from 0.5 (Fig. 2a) to 0.999 (Fig. 2b) 
increases the width of the proposal density but more importantly, it shifts the location 
of the boundary of the region with positive proposal density (dashed blue contour) away 


from the region of significant target density (Fig. 2c). 

We note that although the RML method for generating proposals via the minimization 
of a cost function is somewhat inflexible, the target density for (x, d) can be chosen 
much more freely as long as the marginal density for x is correct. In the terminology of 
Papaspiliopoulos et al. [283, we have chosen to use a centered parameterization, but if 
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we had chosen d = d — (1 — 7 )g(x), then the pair (x, d) would have been conditionally 
independent. That parameterization is similar to the choice made by 121 . : 39l . It does not 
alter the central difficulty, however, of ensuring that the target density is approximately 
zero when the proposal density is zero. 


3 Examples 

This section illustrates the characteristics of the Metroplolized RML method on problems 
with multiple modes. In each case, it is possible to compare the distribution of samples 
with the true posterior probability density for a small inverse problem. The first example 
consists of a single model variable and a single data variable so that the joint augmented 
distribution can be visualized easily. The posterior distribution on the model variable 
is bimodal but the probability in the region between modes is relatively large. In the 
second example, the probability mass is located in more than 100 widely separated modes 
and the efficiency of the Metropolized RML method is compared with the preconditioned 
Crank Nicholson method flQ| . The third example illustrates application to a problem with 
nongaussian prior. 


3.1 Example 1: Bimodal 


In this example, the posterior distribution for the model variable is bimodal. A region of 
small but significant probability density spans the region between the peaks. This problem 
was originally used as to test the sampling of RML algorithm [25j where it was shown that 
correct sampling could be attained if the marginal distribution of transition proposals 
(Eq.§ is used in the MH test. Computation of the marginal distribution was relatively 
difficult, however, even for this single-variable problem because the integration had to be 
limited to the region of the joint model-data space in which the mapping was invertible. 
Here we apply the augmented variable RML method (Algorithm [2j) on the joint model- 
data space, without need for computation of the marginal distribution of proposals. The 
parameter 7 that determines the relative contribution of modelization error vs observation 
error in the posterior is set at a small nonzero value ( 0 . 01 ) to prevent degeneracy of the 
posterior distribution. 

We consider the problem of sampling from the following univariate distribution 

{x-g) 2 (g(x) - dobs ) 2 " 1 


7t(.t) = a exp 


2a 2 


2*3 


( 20 ) 


where g = 1.9, d 0 b s = 0.8, a 2 = 0.1, = 0.01, g{x ) = 1 — 9(x — 27r/3) 2 /2, and a ~ 4.567. 

The true target distribution is shown in Fig. 4a as a solid curve. The first term in Eq. (20) 
is the prior density for m. The second term represents the likelihood term in the Bayesian 
inverse problem. Because of the nonlinearity of the observation operator g(-), the posterior 
is bimodal. An optimal value of p was determined by trial and error, but in fact, the MH 
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(a) Target joint density (7 = (b) RML proposal density (c) Compare joint proposal to 
0.005). (p = 0.995). target. 

Figure 3: Comparison of RML proposal density with the target density for Example 1 
(bimodal) sampling problem. Model variable (horizontal axis), data variable (vertical). 


acceptance rate is not highly sensitive to the value of p. In the interval 0.5 < p < 0.8 the 
acceptance rate for RML proposals is between 62% and 64 %. with a maximum acceptance 
rate of almost 0.64 when p = 0.65, which was the value we used. For the choice p = 0.65 


and 7 = 0.01, the joint proposal density for RML (Fig. 3a) is similar to the posterior joint 
density of x and d (Fig. 3b). Differences can more easily be seen in Fig. 3c where the two 
pdfs are superposed. The main deficiency in the proposal density is a reduced probability 
of proposals in the region between the two peaks. 

Figure [4a] shows the marginal posterior distribution of samples obtained using RML 
(Algorithm^ compared with the target distribution (Eq. |20[) for a chain of length 40,000. 
Figure |4b| shows that the mixing of the chain is very good, as should be expected for an 
independence sampler with a high acceptance rate. 


0 . 12 - 






0 500 1000 1500 2000 


(a) Distribution of samples of model vari- (b) Markov chain for model variable, 
able. 


Figure 4: Metropolized RML sampling for Example 1 (bimodal pdf). 
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3.2 Example 2: Multimodal 

In this example, the prior distribution for the two model variables is independent standard 
normal. The observations are related to model variables as 


s(xi,x 2 ) 


sin(27rxi) 

sin(27rx2) 


The observations are d Q b s = (0.,0.) with variance of observation error equal 0.04. The 
posterior pdf for x is composed of many widely separated peaks making random walk 
methods relatively inefficient (Fig. [5]). 

4 

2 

0 

-2 

-4 

(a) Posterior pdf for model variables (b) Posterior pdf for model variables condi¬ 
tional to x 2 = 0 



Figure 5: True posterior pdf for a problem with many modes. 


The Metropolized RML method (Algorithm [2]) was used for sampling with parameters 
p = 0.995 and 7 = 0.005. Figure [ 6 a] shows the first 40000 elements of the Markov chain and 
Fig. | 6 b| shows the distribution of samples from the same Markov chain in a more restricted 
region. Contours of the true pdf are plotted on top of the samples for comparison. For 
this particular example, the acceptance rate was essentially independent of p over a large 
range, varying between 87.1% for p = 0.02 to 87.4% for p = 0.995. 


4 . 0 , 6 .. 



(a) All samples from MCMC length 40,000. (b) Samples from MCMC in the neighbor¬ 
hood of the origin. 


Figure 6 : True posterior pdf for a problem with many modes. 


Because the acceptance rate for independent proposals is so high in this example, it 
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should be expected that the proposal density for the model variables is quite similar to the 
target density. Fig. [7] compares a section of the estimated pdf from the Metropolized RML 
to the true pdf (top) and a section of the estimated pdf from RML without an acceptance 
test to the true pdf (bottom). The estimated pdf was obtained using a Gaussian smoothing 
kernel with bandwidth 0.01. The differences between the two sampled distributions are 
clearly quite subtle. 




Figure 7: Compare estimated pdf (black contours) to the true pdf (red contours) for 
Metropolized RML (top) and RML without an acceptance test (bottom). Both Markov 
chains were of length 40,000 but only about 4000 samples are within this window. 


3.2.1 Computational cost 

Although the acceptance rate for independent samples in Metropolized RML method is 
quite high, the cost of generating proposals is substantially higher than the cost of a stan¬ 
dard MCMC method. Efficient implementation generally requires estimation of gradients 
of the cost function and a reasonable choice for the initialization of the minimization. The 
computational effort to generate a single proposal in RML for this numerical example is ap¬ 
proximately 15 function evaluations using a modified Levenberg-Marquardt minimization 
with finite difference estimation of gradients (MINPACK subroutine LMDIF) using x uc 
to initialize the minimization. In addition, computation of the Jacobian of the mapping 
requires an additional 5 function evaluations, hence the cost is approximately 20 function 
evaluations per element of the Markov chain. The acceptance rate for MH is 0.873 so the 
cost is approximately 20/0.873 ~ 23 functions evaluations per independent sample from 
the target distribution. 

For comparison, we also applied the preconditioned Crank Nicholson method (pCN) 
|10| to this example. The transition kernel for pCN is of the form 

= ^1-02 u (k) + p £(*) ; f (*) ~ N{ 0, C) 
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where (3 is a tunable parameter of the algorithm and is the state of the Markov chain 
at step k. If one uses (3 = 0.055 for an acceptance rate of 0.234 as recommended, a Markov 
chain of length 2 x 10 6 does not mix outside of a single mode. As an alternative, we selected 
an optimal value of (3 based on an estimate of effective sample size using the algorithm 
in Hoffman and Gelman [18] with 5 independent Markov chains. For this problem with 
multiple widely separated modes, it was determined that (3 = 1 was optimal. Note that 
in this case pCN is an independence sampler. The acceptance rate for the pCN method 
was 0.022, which is equivalent to 46 function evaluations per independent sample, thus 
Metropolized RML is approximately twice as efficient. 

If the standard deviation of noise in the data is reduced from 0.2 to 0.1, the modes 
of the posterior pdf become narrower and more widely separated. The acceptance rate 
for the pCN method falls to 0.0057 which is equivalent to 177 function evaluations per 
independent sample. The acceptance rate for Metropolized RML is slightly higher (0.886) 
than when an = 0.2 but the average number of function evaluations per iteration (15 + 5) 
is almost unchanged from that case. Thus, although the modes of the target pdf are more 
widely separated, the cost of using Metropolized RML is still approximately 20/0.886 ~ 23 
function evaluations per independent sample and the RML method becomes approximately 
8.7 times as efficient as the pCN method. 

For applications of RML to large-scale subsurface flow problems with many data, it 
has been reported that although Newton-like methods typically converge quickly, the com¬ 
putational cost for estimation of the Hessian at each iteration makes Newton-like methods 
inefficient compared to methods that only require computation of the gradient. In a com¬ 
parative study, Zhang and Reynolds [3Q] found the limited memory BFGS minimizer to 
be more efficient than other methods, including Gauss-Newton, Modified Levenberg- Mar- 
quardt, preconditioned conjugate gradient, or BFGS. In all cases, the sample from the 
prior, x uc , was used as the initial guess for minimization. 

3.3 Example 3: Nongaussian prior 

In this example, we look at a problem for which the prior distribution is not Gaussian, 
but for which a transformation to Gaussian variable can be introduced after which the 
previous methodology can be applied. The prior distribution for the model variable is 
exponential with mean and standard deviation equal to 1. A single measurement is made 
of (g(x) = x + e) with Gaussian observation error e ~ IV[0,0.36]. The observed value is 
dobs = I- The prior and posterior distributions for the model variable are shown in Fig. [8} 

Since the prior distribution in this example is not Gaussian, we define a new Gaussian 
variable z related to x as z = F~ x (. F x (x )), where F x and F z are the cdfs for x and z 
respectively. After transformation of the model variable, the joint prior distribution for 
model and data, the joint posterior distribution, and the joint proposal distribution are 
shown in Fig. [9j The joint proposal density that results from minimization with p = 0.25 
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X X 


Figure 8 : Prior and posterior model variable distributions for Example 3. 


is similar to the joint posterior density with 7 = 0 . 01 . 



Figure 9: Joint prior, posterior, and proposal distributions for Gaussian model variable 
and data variable. Model variable on horizontal axis. 


An inverse transform must be applied to each sample from the Markov chain to obtain 
samples in the original model space. Fig. |10a shows that the distribution of Monte Carlo 
samples is indistinguishable from the target posterior distribution. When this proposal 
density is used in the RML independence Meteropolis sampler (Algorithm [ 2 ]) the accep¬ 
tance rate is 74%. Because the proposals are independent and the acceptance rate is high, 
the mixing in the chain is very good. Figure shows the first 2000 elements of the 
chain, with no evidence of a burn-in period or of nonstationary behavior. Recall that the 
candidates are first generated from the joint prior, then a minimization is performed to 


place the proposals in regions of high probability. Fig. 10c shows the mapping of the first 
200 accepted candidates in the Markov chain from the joint prior to the proposal distri¬ 
bution. The weighting has already been used for this figure so that the 26% unsuccessful 
candidates are not shown. 


4 Discussion 

In this paper, we proposed an augmented variable independence Metropolis sampler that 
uses minimization to place proposals in regions of high probability. The computation 


17 






















(a) Distribution of samples (b) First 2000 elements of the (c) Mapping of variables from 
from Algorithm 2 (histogram) Markov chain for the model the prior to the posterior, 
compared with posterior dis- variable, 
tribution (solid curve). 

Figure 10: Summary plots for Example 3 in the original variable with an exponential prior 
distribution. 

required for a single proposal is considerably higher than the computation required for a 
typical MH algorithm, but because the proposals are independent and the acceptance rate 
is high, the cost per independent sample from the target distribution can be low compared 
to alternative methods. The biggest limitation for application of the method to geoscience 
inverse problems will be the computation of the Jacobian determinant of the mapping that 
generates proposals. 

The augmented variable RML method described in this paper contains two parameters 
p and 7 that must be tuned for efficiency and for validity of sampling. For linear obser¬ 
vations, the proposal density and the target density can be made close by setting both 
parameters to take small values. For multimodal posterior pdfs, however, the proposal 
density g(x*,d*) ^ 0 for all 7r(x*,d*) > 0 for arbitrary choices of p and 7 so the Markov 
chain may not converge to 7 r(-, •) unless /) < 1 and 7 ~ 1. The acceptance rate for the 
numerical examples has relatively insensitive to the the value of 7 over a fairly large range. 
Unlike typical applications of Metropolis random walk samplers, which often get stuck for 
long periods in a single mode of a multimodal distribution, the RML Metropolis sampler 
is more likely to remain in the regions between modes as the proposal density tends to 
undersample from those regions. 

For cases in which the prior is not Gaussian, it can sometimes be possible to use 
anamorphosis to transform the variables so that the prior is Gaussian, in which cases the 
algorithm can be used for sampling. This is straightforward for single variable problems, 
but finding a transformation to multivariate Gaussian in high dimensions seems impracti¬ 
cal. 
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